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ABSTRACT. In their seminal work, Alon, Matias, and Szegedy introduced several sketching tech- 
niques, including showing that 4-wise independence is sufficient to obtain good approximations of 
the second frequency moment. In this work, we show that their sketching technique can be extended 
to product domains [n] k by using the product of 4-wise independent functions on [n]. Our work 
extends that of Indyk and McGregor, who showed the result for k = 2. Their primary motivation was 
the problem of identifying correlations in data streams. In their model, a stream of pairs (z, j) £ [n] 2 
arrive, giving a joint distribution (X, Y), and they find approximation algorithms for how close the 
joint distribution is to the product of the marginal distributions under various metrics, which naturally 
corresponds to how close X and Y are to being independent. By using our technique, we obtain a 
new result for the problem of approximating the £2 distance between the joint distribution and the 
product of the marginal distributions for fc-ary vectors, instead of just pairs, in a single pass. Our 
analysis gives a randomized algorithm that is a (1 ± e) approximation (with probability 1 — 5) that 
requires space logarithmic in n and m and proportional to 3 fe . 
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1. Introduction 

In their seminal work, Alon, Matias and Szegedy [4 ] presented celebrated sketching techniques 
and showed that 4-wise independence is sufficient to obtain good approximations of the second 
frequency moment. Indyk and McGregor lfl2l make use of this technique in their work introduce 
the problem of measuring independence in the streaming model. There they give efficient algo- 
rithms for approximating pairwise independence for the l\ and £2 norms. In their model, a stream 
of pairs (i, j) G [n] 2 arrive, giving a joint distribution (X, Y), and the notion of approximating pair- 
wise independence corresponds to approximating the distance between the joint distribution and the 
product of the marginal distributions for the pairs. Indyk and McGregor state, as an explicit open 
question in their paper, the problem of whether one can estimate A;-wise independence on fc-tuples 
for any k > 2. In particular, Indyk and McGregor show that, for the £2 norm, they can make use 
of the product of 4-wise independent functions on [n] in the sketching method of Alon, Matias, and 
Szegedy. We extend their approach to show that on the product domain [n] k , the sketching method 
of Alon, Matias, and Szegedy works when using the product of k copies of 4-wise independent 
functions on [n]. The cost is that the memory requirements of our approach grow exponentially 
with k, proportionally to 3 k . 

Measuring independence and fc-wise independence is a fundamental problem with many ap- 
plications (see e.g., Lehmann lfl"3l0 . Recently, this problem was also addressed in other models by, 
among others, Alon, Andoni, Kaufman, Matulef, Rubinfeld and Xie HI; Batu, Fortnow, Fischer, 
Kumar, Rubinfeld and White 0; Goldreich and Ron ifTTl : Batu, Kumar and Rubinfeld [6j; Alon, 
Goldreich and Mansour [3]; and Rubinfeld and Servedio lfT5l . Traditional non-parametric methods 
of testing independence over empirical data usually require space complexity that is polynomial 
to either the support size or input size. The scale of contemporary data sets often prohibits such 
space complexity. It is therefore natural to ask whether we will be able to design algorithms to test 
for independence in streaming model. Interestingly, this specific problem appears not to have been 
introduced until the work of Indyk and McGregor. While arguably results for the l\ norm would be 
stronger than for the £2 norm in this setting, the problem for £2 norms is interesting in its own right. 
The problem for the £\ norm has been recently resolved by Braverman and Ostrovsky in (§)■ They 
gave an (1 ± e, ^-approximation algorithm that makes a single pass over a data stream and uses 
polylogarithmic memory. 

1.1. Our Results 

In this paper we generalize the "sketching of sketches" result of Indyk and McGregor. Our 
specific theoretical contributions can be summarized as follows: 

Main Theorem. 

Let v G R/ n ) be a vector with entries v p G R for p G [n] k . Let hi, ■ ■ ■ , : [n] — > {—1, 1} be 
independent copies of 4-wise independent hash functions; that is, hi(l), . . . , h%(ri) € {—1, 1} are 
4-wise independent hash functions for each i G [k], and hi (•),...,%(■) are mutually independent. 
Define H(p) = Yl^ =1 hj(pj), and the sketch Y = J2 P G[n] k v p H(p). 

We prove that the sketch Y can be used to give an efficient approximation for \\v\\ 2 ; our result 
is stated formally in Theorem 14. 21 Note that H is not 4-wise independent. 

As a corollary, the main application of our main theorem is to extend the result of Indyk and 
McGregor lfl2l to detect the dependency of k random variables in streaming model. 
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Corollary 1.1. For every e > and 5 > 0, there exists a randomized algorithm that computes, 
given a sequence oi, . . . , a m of k-tuples, in one pass and using 0(3 fc e~ 2 log ^(logm + logn)) 
memory bits, a number Y so that the probability Y deviates from the 1% distance between product 
and joint distribution by more than a factor of(l + e) is at most 5. 



1.2. Techniques and a Historical Remark 

This paper is merge from CJ|9l[T0l, where the same result was obtained with different proofs. 
The proof of [ 10] generalizes the geometric approach of Indyk and McGregor [ 12] with new geo- 
metric observations. The proofs of (7] 13 are more combinatorial in nature. These papers offer new 
insights, but due to the space limitation, we focus on the proof from [9] in this paper. Original 
papers are available on line and are recommended to the interested reader. 



2. The Model 

We provide the general underlying model. Here we mostly follow the notation of f7l [T2l . 

Let S be a stream of size m with elements a\, . . . , a m , where en = (aj, . . . , a k ) € [n] k . (When 
we have a sequence of elements that are themselves vectors, we denote the sequence number by 
a subscript and the vector entry by a superscript when both are needed.) The stream S defines an 
empirical distribution over [n] k as follows: the frequency f(co) of an element oj G [n] k is defined as 
the number of times it appears in S, and the empirical distribution is 

f Ycj) 

PrM = for any co G [n] k . 

m 

Since to = (cji, . . . , cjk) is a vector of size k, we may also view the streaming data as defining 
a joint distribution over the random variables X\, . . . , corresponding to the values in each di- 
mension. (In the case of k = 2, we write the random variables as X and Y rather than X\ and X2.) 
There is a natural way of defining marginal distribution for the random variable X{ : for € [n] , 
let fi(<jJi) be the number of times Ui appears in the zth coordinate of an element of S, or 

fi(uji) = \{cij € S : a) = U)i}\ . 

The empirical marginal distribution Prj[-] for the ith coordinate is defined as 

Prjfwj] = li—L for any cjj € [rtl. 
;// 

Next let v be the vector in Ri n l fc with v w = Pr[w] — ni<i</c P^^i] f° r au w € [n\ k . Our goal 
is to approximate the value 



E 



PrM - [J P * 



Ki<k 



(2.1) 



This represent the £2 norm between the tensor of the marginal distributions and the joint distribution, 
which we would expect to be close to zero in the case where the Xi were truly independent. 

Finally, our algorithms will assume the availability of 4-wise independent hash functions. For 
more on 4-wise independence, including efficient implementations, see El[T6l. For the purposes of 
this paper, the following simple definition will suffice. 
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Definition 2.1. (4-wise independence) A family of hash functions % with domain [n] and range 
{— 1, 1} is 4-wise independent if for any distinct values i\, i 2 , i 3 , 14 6 [n] and any b±, b 2 , 63, 64 € 
{— 1, 1}, the following equality holds, 

Pr [h(h) = h, h{i 2 ) = b 2 , h(h) = 63, h(u) = h] = 1/16. 

h<—'H 

Remark 2.2. In lfl2l . the family of 4-wise independent hash functions % is called 4-wise indepen- 
dent random vectors. For consistencies within our paper, we will always view the object % as a 
hash function family. 



3. The Algorithm and its Analysis for k = 2 

We begin by reviewing the approximation algorithm and associated proof for the £ 2 norm given 
in [12]. Reviewing this result will allow us to provide the necessary notation and frame the setting 
for our extension to general k. Moreover, in our proof, we find that a constant in Lemma 3.1 
from [12] that we subsequently generalize appears incorrect. (Because of this, our proof is slightly 
different and more detailed than the original.) Although the error is minor in the context of their 
paper (it only affects the constant factor in the order notation), it becomes more important when 
considering the proper generalization to larger k, and hence it is useful to correct here. 

In the case k = 2, we assume that the sequence {a\, af), (a\, a|), . . . , (a^, a^J arrives an item 
by an item. Each (aj, a 2 ) (for 1 < i < m) is an element in [n] 2 . The random variables X and Y 
over [n] can be expressed as follows: 

Pr[i,j] = Pr[X = i,Y = j] = \{£:(a],aj) = (i,j)}\/m 
Pri[i] = Pr[X = i] = \{£:(aj,aj) = (i,-)}\/m 

Pr 2 [j] = Pr[F = j] = \{£:(alaj) = (;j)}\/m. 

We simplify the notation and use pi = Pi[X = i], qj = Pr[Y = j], r^j = Pv[X = i,Y = j]. and 
1V.Y i\'vY j. 

Indyk and McGregor's algorithm proceeds in a similar fashion to the streaming algorithm pre- 
sented in [4]. Specifically let sx = 72e -2 and s 2 = 2 log (1/5). The algorithm computes s 2 random 
variables Yi,Y 2 , . . . , Y S2 and outputs their median. The output is the algorithm's estimate on the 
norm of v defined in Equation 12. II Each Yi is the average of s\ random variables Yif 1 < j < si, 
where Y^ are independent, identically distributed random variables. Each of the variables D = Dij 
can be computed from the algorithmic routine shown in Figure Q] 

2-D Approximation ((a{,af), . . . , (aj,,c^)) 

1 Independently generate 4-wise independent random functions hi, h 2 from [n] to {—1, 1}. 

2 for c -s— 1 to m 

3 do Let the cth item {a\ , a 2 ) = 

4 t x <r- h + hi{i)h 2 {j), t 2 ^t 2 + hi(i), t 3 ^t 3 + h 2 (j). 

5 Return Y = (h/m - t 2 t 3 /m 2 ) 2 . 

Figure 1: The procedure for generating random variable Y for k = 2. 

By the end of the process 2-D APPROXIMATION, we have ti/m = J2i je[n] hi(i)h 2 (j)rij, t 2 /m = 
J2ie[n] hi(i)pu and t 3 /m = J2ie[n] ^(^Qi- Also, when a vector is in K^ n2 \ its indices can be 
represented by (i\,i 2 ) G [re] 2 . In what follows, we will use a bold letter to represent the index of a 
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high dimensional vector, e.g., V{ = Vi lt i 2 . The following Lemma shows that the expectation of Y is 
\\v II and the variance of Y is at most 8(E[Y]) 2 because E[Y 2 ] < 9E[Y} 2 . 

Lemma 3.1. ([12]) Let hi, /i 2 he two independent instances of 4-wise independent hash functions 
from [n] to {—1, 1}. Let v G R n and H(i)(= iT((z'i,« 2 )) = hi(ii) ■ fo 2 (i 2 ). Let us define Y = 

(EieH 2 ^^^ 1 ) 2 ' Tkm E ^ = ^ie[n] 2 ^ and V[Y 2 ] < 9(E[Y]) 2 , which implies Var[Y] < 
8E 2 [Y]. 

Proof WehaveE[Y] = E[(£i H(i)vi) 2 } = tfE[tf 2 (i)] + v^EiH (i) H (})}. For all 
i G [n] 2 , we know h 2 (\) = 1. On the other hand, H(i)H(j) G {-1,1}. The probability that 
H(i)H(j) = 1 is Pr[H(i)H(j) = 1] = Pr[/ l i(n)/i 1 (i 1 )^ 2 (i 2 )/t 2 (i 2 ) = 1] = 1/16 + g)l/16 + 
1/16 = 1/2. The last equality holds is because /ii(ii)/ii(ii)/i2(*2)^2(j2) = 1 is equivalent to 
saying either all these variables are 1, or exactly two of these variables are -1, or all these variables 
are -1. Therefore, E[h(i)h(j)} = 0. Consequently, E[Y] = Eie^^i) 2 ■ 

Now we bound the variance. Recall that Var[Y] = E[Y 2 ] — E[Y] 2 , we bound 

E[Y 2 ]= HH(i)H(i)H{k)h(lM^^A < |E[^(i)H(j)i?(k)iI(l)]H^ k i/ 1 | 

ij,k,le[n] 2 i,j,k,lG[n] 2 

Also \E[H(i)H(j)H(k)H(l)}\ G {0,1}. The quantity E[H(i)H(j)H(k)H(l)} + if and only 
if the following relation holds, 

Vs G [2] : ((i s = j s ) A (k s = l s )) V ((i s = k s ) A (j s = l s )) V ((i s = l s ) A (k s = j s )) . (3.1) 

Denote the set of 4-tuples (i, j, k, 1) that satisfy the above relation by V. We may also view each 
4-tuple as an ordered set that consists of 4 points in [n] 2 . Consider the unique smallest axes-parallel 
rectangle in [n] 2 that contains a given 4-tuple in V (i.e. contains the four ordered points). Note this 
could either be a (degenerate) line segment or a (non-degenerate) rectangle, as we discuss below. 
Let M : V — > {A, B, C, D} be the function that maps an element a G V to the smallest rectan- 
gle ABCD defined by a. Since a rectangle can be uniquely determined by its diagonals, we may 
write M : V — > (xi, X2, Vl> P2), where xi < X2 G [n], ipi < if2 G [n] and the corresponding 
rectangle is understood to be the one with diagonal {(xi, </?i)i (X2, ^2)}- Also, the inverse function 
M~ l (xi,X2,Vi,¥2) represents the pre-images of (xi, X2, <Pi, V2) in V. (xi, X2, <£l, V2) is degen- 
erate if either xi = X2 or ip% = <^ 2 , in which case the rectangle (and its diagonals) correspond to 
the segment itself, or xi = X2 and ip\ = (fi2, and the rectangle is just a single point. 

Example 3.2. Let i = (1,2), j = (3,2), k = (1,5), and 1 = (3,5). The tuple is in V and 
its corresponding bounding rectangle is a non-degenerate rectangle. The function M(i,j,k, 1) = 
(1,3,2,5). 

Example 3.3. Let i = j = (1, 4) and k = 1 = (3, 7). The tuple is also in V and minimal bound- 
ing rectangle formed by these points is an interval {(1,4), (3,7)}. The function M(i,j,k, 1) = 
(1,3,4,7). 

To start we consider the non-degenerate cases. Fix any (xi, X2, ¥>i> ¥2) with xi < X2 and 0i < 
02- There are in total (2) = 36 tuples (i, j, k, 1) in V with M (i, j, k, 1) = (xi, x 2 , <fi, ¥> 2 )- Twenty- 
four of these tuples correspond to the setting where none of i, j, k, 1 are equal, as there are twenty- 
four permutations of the assignment of the labels i,j,k, 1 to the four points. (This corresponds 
to the first example). In this case the four points form a rectangle, and we have IwiVjiTk^il < 
^((^xi.vi^,^) 2 + (^xi>'/>2^x2,</>i) 2 )- Intuitively, in these cases, we assign the "weight" of the 
tuple to the diagonals. 
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The remaining twelve tuples in M _1 (xi, X2, <Pl, ^2) correspond to intervals. (This corre- 
sponds to the second example.) In this case two of i, j, k, 1 correspond to one endpoint of the inter- 
val, and the other two labels correspond to the other endpoint. Hence we have either |wiWjUkWi| = 
(%i^i%2,v>2) 2 or l^^^k^l] = (^xi,<fi2^X2,fi) 2 ' an d there are six tuples for each case. 

Therefore for any xi < X2 £ [n] and ipi < 922 £ [n] we have: 

,<Pl V X2>V>3,) + ( V Xl,<P2l V X2,<Pl) )• 

(i,j,k,l)£ 

&f _ (xi,X2,<Pl,¥>2) 

The analysis is similar for the degenerate cases, where the constant 18 in the bound above is 
now quite loose. When exactly one of xi = X2 or ip± = ip2 holds, the size of M~ 1 (xi, X2, ^1,^2) 
is (2) = 6, and the resulting intervals correspond to vertical or horizontal lines. When both xi = X2 
and ipi = if2, then |M _1 (xi, X2, <p\, f2)\ = 1- In sum, we have Following the same analysis as for 
the non-degenerate cases, we find 

i,j,k,leD xi<X2 (i,j,k,i)e 

Vl<<P3 M- 1 (xi,X2,Vl,V2) 

— ^(.( V Xl>fl V X2,<P2) + ( V Xl><P2 V X2t<Pl) ) + Y^ ^(( W XliVl W X2,¥'2) + { V XllV>2 V XWl) ) 

X1<X2 X1=X2 

X1<X2 X1=X2 

<9^(^j) 2 = 9E 2 [y]. 

i6[n] 2 
jS[n] 2 

Finally, we have Eij,k,ie[n]2 l E [- ff ( i )- ff Ci)- H '( k )- H '( 1 )]l ' l^j^ll < Eij,k,le© l^j^k«l| < 
9E 2 [ Y ] and Var [Y ] < 8E [Y ] 2 . m 

We emphasize the geometric interpretation of the above proof as follows. The goal is to bound 
the variance by a constant times E 2 [Y] = je[n]2 (i?ifj) 2 , where the index set is the set of all possi- 
ble lines in plane [n] 2 (each line appears twice). We first show that Var[Y] < Ylij k ier> l^i^j^k^i|> 
where the 4-tuple index set corresponds to a set of rectangles in a natural way. The main idea of lfi~2l 
is to use inequalities of the form \viVjV]<vi\ < \{{v Xl)Vl v Xi)lf)i )' 2 + (v Xli(p2 v X2m ) 2 ) to assign the 
"weight" of each 4-tuple to the diagonals of the corresponding rectangle. The above analysis shows 
that 18 copies of all lines are sufficient to accommodate all 4-tuples. While similar inequalities could 
also assign the weight of a 4-tuple to the vertical or horizontal edges of the corresponding rectangle, 
using vertical or horizontal edges is problematic. The reason is that there are 0(n 4 ) 4-tuples but 
only 0(n 3 ) vertical or horizontal edges, so some lines would receive O(n) weight, requiring f2(n) 
copies. This problem is already noted in 0. 

Our bound here is E[Y 2 ] < 9E 2 [Y], while in 021 the bound obtained is E[Y 2 ] < 3E 2 [Y]. 
There appears to have been an error in the derivation in [12]; some intuition comes from the fol- 
lowing example. We note that \V\ is at least L) • (™) = 9n 4 — 9n 2 . (This counts the number 
of non-degenerate 4-tuples.) Now if we set Vi = 1 for all 1 < i < n 2 , we have E[Y 2 ] > \V\ = 
9n 4 - 9n 2 ~ 9E 2 (D), which suggests Var[L>] > 3E 2 [D]. Again, we emphasize this discrepancy is 
of little importance to lfl2l ; the point there is that the variance is bounded by a constant factor times 
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the square of the expectation. It is here, where we are generalizing to k > 3, that the exact constant 
factor is of some importance. 

Given the bounds on the expectation and variance for the -Djj, standard techniques yield a 
bound on the performance of our algorithm. 

Theorem 3.4. For every e > and 5 > 0, there exists a randomized algorithm that computes, given 
a sequence {a\,a\), . . . , (a^, a^J, in one pass and using 0(e~ 2 log -r(log m + log n)) memory bits, 
a number Med so that the probability Med deviates from \\v\\ 2 by more than e is at most 5. 

Proof. Recall the algorithm described in the beginning of Section [3j let si = 72e~ 2 and S2 = 
2 log S. We first computes S2 random variables Y%, Y 2 , ■ ■ ■ , Y S2 and outputs their median Med, where 
each Y{ is the average of si random variables Yif, 1 < j < si and Y^j are independent, identically 
distributed random variables computed by Figure [TJ By Chebyshev's inequality, we know that for 
any fixed i, 

Var(Fi) (l/si)Var[Y] (9e 2 /72)||iJ|| 2 1 



Pr (\Ys - lul) > e\\v\\\ < 



e 2 !! - ^!! 2 e 2 ||i?|| 2 g 2 \\v\ 



Finally, by standard Chernoff bound arguments (see for example Chapter 4 of lfT4l "). the probability 
that more than S2/2 of the variables Yi deviate by more than e\\v\\ from \\v\\ is at most 5. In case this 
does not happen, the median Med supplies a good estimate to the required quantity \\v\\ as needed. ■ 



4. The General Case k > 3 

Now let us move to the general case where k > 3. Recall that v is a vector in R" fe that maintains 
certain statistics of a data stream, and we are interested in estimating its £ 2 norm \\v\\. There is a 
natural generalization for Indyk and McGregor's method for k = 2 to construct an estimator for 
\\v\\: let hi, ... ,hk ■ [n] — > {—1, 1} be independent copies of 4-wise independent hash functions 
(namely, ^ (1) , . . . , hi{n) 6 {—1, 1} are 4-wise independent hash functions for each i € [k], and 
hi(-), . . . , /ifc(-) are mutually independent.). Let H(p) = Yu=i hjiPj)- The estimator Y is defined 

as Y = (E P e[n]fe v P H(pj) ■ 

Our goal is to show that E[V] = ||i;|| 2 and Var[F] is reasonably small so that a streaming 
algorithm maintaining multiple independent instances of estimator Y will be able to output an ap- 
proximately correct estimation of \\v || with high probability. Notice that when \\v\\ represents the £2 
distance between the joint distribution and the tensors of the marginal distributions, the estimator 
can be computed efficiently in a streaming model similarly to as in Figure [TJ We stress that our 
result is applicable to a broader class of ^-norm estimation problems, as long as the vector v to 
be estimated has a corresponding efficiently computable estimator Y in an appropriate streaming 
model. Formally, we shall prove the following main lemma in the next subsection. 

k 

Lemma 4.1. Let v be a vector in R n , and hi, . . . ,hk ■ [n] — >• {—1,1} be independent copies of 
4-wise independent hash functions. Define H(p) = nf=i hj(Pj)> an d Y = Q^pe[n]* ! ^p^(p)) • 
We have E[Y] = \ \v\ \ and Var[Y] < 3 k E[Y} 2 . 

We remark that the bound on the variance in the above lemma is tight. One can verify that 
when the vector v is a uniform vector (i.e., all entries of v are the same), the variance of Y is 
Q(3 k E[Y] 2 ). With the above lemma, the following main theorem mentioned in the introduction 
immediately follows by a standard argument presented in the proof of Theorem 13.41 in the previous 
section. 
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Theorem 4.2. Let v be a vector in Ri n l that maintains an arbitrary statistics in a data stream 
of size m, in which every item is from [n] k . Let e, 5 € (0, 1) be real numbers. If there exists an 
algorithm that maintains an instance ofY using 0(p>(n, m, k, e, 5)) memory bits, then there exists 
an algorithm A such that: 

(1) With probability > 1 — 5 the algorithm A outputs a value between [( 1 — e) 1 1 v \ \ 2 , ( 1 + e) 1 1 v \ 2 ] 
and 

(2) the space complexity of A is 0(3 k -^ log ^fi(n, m, k, e, 5)). 

As discussed above, an immediate corollary is the existence of a one-pass space efficient 
streaming algorithm to detect the dependency of k random variables in ^-norm: 

Corollary 4.3. For every e > and 5 > 0, there exists a randomized algorithm that computes, 
given a sequence a±, . . . , a m of k-tuples, in one pass and using 0(3 fc e~ 2 log ^(log m + log n)) 
memory bits, a number Y so that the probability Y deviates from the square of the I2 distance 
between product and joint distribution by more than a factor of {1 + e) is at most 5. 

4.1. Analysis of the Sketchy 

This section is devoted to prove Lemma |4~T1 where the main challenge is to bound the variance 
of Y. The geometric approach of Indyk and McGregor lfT2l presented in Section [3] for the case of 
k = 2 can be extended to analyze the general case. However, we remark that the generalization 
requires new ideas. In particular, instead of performing "local analysis" that maps each rectangle 
to its diagonals, a more complex "global analysis" is needed in higher dimensions to achieve the 
desired bounds. The alternative proof we present here utilizes similar ideas, but relies on a more 
combinatorial rather than geometric approach. 

For the expectation of Y, we have 

vv#(p)-#(q) 

Jp,qSM fe 

= ^^•E[ J ff(p) 2 ]+ VVE[#(p)#(q)] 

pG[n] fc p^qG[n] fc 

E-2 IH|2 

p€[rt] fc 

where the last equality follows by H(p) 2 = 1, and E [H(p)H(q)] = for p 7^ q. 

Now, let us start to prove Var[Y] < 3 k E[Y} 2 . By definition, Var[Y] = E[(Y - E[Y]) 2 ], so we 
need to understand the following random variable: 

Err = Y-E[Y}= ^ #(p)F(q)?7 p ?7 q . (4.1) 

p^qG[n] fc 

The random variable Err is a sum of terms indexed by pairs (p, q) G [n] k x [n] k with p 7^ q. At 
a very high level, our analysis consists of two steps. In the first step, we group the terms in Err 
properly and simplify the summation in each group. In the second step, we expand the square of 
the sum in Var[Y] = E[£rr 2 ] according to the groups and apply Cauchy-Schwartz inequality three 
times to bound the variance. 

We shall now gradually introduce the necessary notation for grouping the terms in Err and 
simplifying the summation. We remind the reader that vectors over the reals (e.g., v € R n ) are 



E[Y] = E 
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denoted by v, w, r, and vectors over [n] are denoted by p, q, a, b, c, d and referred as index vectors. 
We use S C [k] to denote a subset of [k], and let S = [k]\S. We use Ham(p, q) to denote the 
Hamming distance of index vectors p, q G [n] k , i.e., the number of coordinates where p and q are 
different. 

Definition 4.4. (Projection and inverse projection) Let c G [n] k be an index vector and S C [k] a 
subset. We define the projection of c to S, denoted by 3>s(c) £ [n]l 5 l, to be the vector c restricted 
to the coordinates in S. Also, let a G [n]^ and b G [n] fe ~l 5 l be index vectors. We define the inverse 
projection of & and b with respect to S, denoted by < &^ 1 (a, b) G [n] k , as the index vector c G [n] k 
such that 3>s(c) = a and &§{c) = h. 

We next define pair groups and use the definition to group the terms in Err. 

Definition 4.5. (Pair Group) Let S C [A;] be a subset of size \S\ = t. Let c, d G [nf be a pair of 
index vectors with Ham(c, d) = t (i.e., all coordinates of c and d are distinct.). The pair group 
o~s(c, d) is the set of pairs (p, q) G [n] k x [n] k such that (i) on coordinate S, $s(p) = c an d 
$s(q) = d, and (ii) on coordinate S, p and q are the same, i.e., $<;(p) = 3>s(q). Namely, 

a 5 (c,d) = {(p,q) G [n] k x [n] k : (c = $ s (p)) A (d = d> 5 (q)) A ($ 5 (p) = *g(q))} . 

(4.2) 

To give some intuition for the above definitions, we note that for every a G [n]^, there is a 
unique pair (p, q) G <Js(c, d) with a = ^g(p) = $s(q), and so |<7s(c, d)| = nl 5 L On the other 
hand, for every pair (p, q) G [n] k x [n] k with p / q, there is a unique non-emtpy S C [A;] such 
that p and q are distinct on exactly coordinates in S. Therefore, (p, q) belongs to exactly one pair 
group as (c, d). It follows that we can partition the summation in Err according to the pair groups: 

Err =Y, E E # (P)#(q)«ptfq- (4-3) 

SC[k] c,dg[n]l s l, (p,q)e 

Ham(c,d)=|S| <Tfif(c,d) 

We next observe that for any pair (p, q) G <7s(c, d), since p and q agree on coordinates in S, 
the value of the product H(p)H(q) depends only on S, c and d. More precisely, 

H(p)H(d)= \[h i {p i )h i {q i )= [Hhiip^hiiqiU ■ m>(p*) 2 ) =\{h l (p i )h i (q i ), 
ie[k] \ies J \ies ) ies 

which depends only on S, c and d since ^s(p) = c and ^s(q) = d. This motivates the definition 

of projected hashing. 

Definition 4.6. (Projected hashing) Let S = {si, S2, ■ ■ ■ , St} be a subset of [A;], where s\ < S2 < 
■ ■ ■ < sj. Let c G [n] 1 . We define the projected hashing Hg(c) = \\ i<t h Si (ci). 

We can now translate the random variable Err as follows: 

/ \ 

E rr= E E 



SQ[k] c,dG[n]l s l, 



H s (c)H s (d) 

(p,q)e 



(4.4) 



Ham(c,d) = |S| \ cr s (c,d) / 

Fix a pair group as(c, d), we next consider the sum q ) £(TS ( c d) v p Vq. Recall that for every 
a G [n]' 5 ', there is a unique pair (p, q) G o\s(c, d) with a = <&<j(p) = ^s(q)- The sum can be 
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viewed as the inner product of two vectors of dimension n' 5 ' with entries indexed by a G [n]^. To 
formalize this observation, we introduce the definition of hyper-projection as follows. 

Definition 4.7. (Hyper-projection) Let v G R n , S C [k], and c G [n]' 5 '. The hyper-projection 
Tg jC (v) of u (with respect to 5 and c) is a vector zZ; = Ts iC (u) in R,M fc |s| such that = w$-i/ c d) 
for all d G [n] fc -l 5 L 

Using the above definition, we continue to rewrite the Err as 

Err=Y, E ffs(c)^5(d)-(T5,c(^),T Sj d(v)). (4.5) 

sc[fc] c ,deM |s| , 

Ham(c,d)=|S| 

Finally, we consider the product Hg(c)Hg(d) again and introduce the following definition to 
further simplify the Err. 

Definition 4.8. (Similarity and dominance) Let t be a positive integer. 

• Two pairs of index vectors (c, d) G [n] 1 x [n]* and (a, b) G [n]* x [n]* are similar if for all 
i£ [t], the two sets {q, dj} and {aj, 6j} are equal. We denote this as (a, b) ~ (c, d). 

• Let c and d G [n]* be two index vectors. We say c is dominated by d if a < di for all 
i G [t]. We denote this as c -< d. Note that c -< d Ham(c, d) = t. 

Now, note that if (a, b) ~ (c, d), then Hg(a)Hg(h) = Hs(c)Hg(d) since the value of the 
product Hs(c)Hs(d) depends on the values {q, di} only as a set. It is also not hard to see that ~ 
is an equivalence relation, and for every equivalent class [(a, b)], there is a unique (c, d) G [(a, b)] 
with c -< d. Therefore, we can further rewrite the Err as 

Err = E E H s (c)H s (d) ■ I Yl (?sA^ S>h m). (4.6) 

SQ[k] c-<dG[n]IS| \(a,b)~(c,d) / 

We are ready to bound the term EfE'rr 2 ] by expanding the square of the sum according to 
Equation (14.6b - We first show in Lemma [4!9l below that all the cross terms in the following expansion 
vanish. 

Var[y]= Y, E ms(c)H s (d)H s ,(c')H s ,(d')}- 

S,S'C[k] c^de[n]l s l 

r s , a (v),r s ,b(v))\ I E (^s', a '{v),r sl , v (ff))\ . (4.7) 

/ \(a',b')~(c',d') / 

Lemma 4.9. Let S and S' be subsets of [k], and c -< d G [n]\ s \ and c' -< d' G [n]'' 5 '' index vectors. 
We have E[Hs(c)Hs(d)Hs'(c')Hs'(d')] G {0,1}. Furthermore, we have 
E[Hs(c)H s (d)H s ,(c')H s ,(d')] = liff(S = S') A (c = c') A (d = d'). 

Proof. Recall that hi , . . . , are independent copies of 4- wise independent uniform random vari- 
ables over {—1,1}. Namely, for every i G [k], hi(l), . . . , hi(n) are 4-wise independent, and 
hi (■),..., /ifc(-) are mutually independent. Observe that for every i G [k], there are at most 4 
terms out of ^i(l), . . . ,hi(n) appearing in the product Hs(c)Hs(d)Hs'(c')Hs'(d'). It follows 
that all distinct terms appearing in Hs(c)Hs(d)Hs'(c')Hs'(d') are mutually independent uniform 
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random variable over {—1,1}. Therefore, the expectation is either 0, if there is some hi(j) that 
appears an odd number of times, or 1, if all hi(j) appear an even number of times. By inspection, 
the latter case happens if and only if (S = S') A (c = c') A (d = d'). ■ 



By the above lemma, Equation (14.11) is simplified to 



Var[F]=^ Yl E <Ts,a(tO,T 5 ,b(tf)> ■ 



(4.8) 



SC[k] c-!de[ra]l 5 l \(a,b)~(c,d) 
S^tD 



We next apply the Cauchy-Schwartz inequality three times to bound the above formula. Con- 
sider a subset S C [k] and a pair c -< d € [n]\ s \. Note that there are precisely 2l 5 l pairs (a, b) such 
that (a, b) ~ (c, d). Thus, by the Cauchy-Schwartz inequality: 



\ 



E ( T S,a(«1Js,b(«)) 



(a,b)G[n]l s l 
\(a,b)~(c,d) 



< 2^ Yl (( T s,a,T 5 , b )) 5 



/ 



(a,b)G[n]l s l 
(a,b)~(c,d) 



< 2 i 5 i Y ^sA^^s, a m-(^s,b,r s , h (v)). 



(a,b)G[n]l s l 
(a,b)~(c,d) 

Notice that in the second inequality, we applied Cauchy-Schwartz in a component-wise manner. 
Next, for a subset S C [k], we can apply the Cauchy-Schwartz inequality a third time (from the 
third line to the fourth line) as follows: 

2 



E 

c^d€[n]l s l 



/ \ 

E (T5,a(^),T 5l b(«)) 



(a,b)G[n]l s l 
\(a,b)~(c,d) 



/ 



< 2 ' 5 ' E E ^sAv)^sA^))-^s,Uv),T s , h (v)) 



c^de[n]l s l (a,b)G[n]l s l 
(a,b)~(c,dj 

= 2^ Y PsM,?sM) ■ {TsAV,TsA*)) 

c,de[n]l s l 
Ham(c,d)=|S| 

< 2i 5 i Y (rsAv),?sAv))-(rsAv)^sAv)) 

c,de[n] |S| 

Y VsA^sAv)) 

yce[n]\S\ 

Finally, we note that by definition, we have ^ce[n]l s l {'^S,A'")> ^sA^)) = 1 H | 2 > which equals 
to E[E]. It follows that the variance in Equation (14.81 ) can be bounded by 

Var[E] < Y 2 ' S ' ' E [ y ] 2 = E M 2 E ( k -) 2i = ( 3 * ~ 1 ) E I y ] 2 ' 

SC[k],S^$ i=l 
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which finishes the proof of Lemma |4~T1 
5. Conclusion 

There remain several open questions left in this space. Lower bounds, particularly bounds that 
depend non- trivially on the dimension k, would be useful. There may still be room for better algo- 
rithms for testing /e-wise independence in this manner using the £2 norm. A natural generalization 
would be to find a particularly efficient algorithm for testing fc-out-of-s-wise independence (other 
than handling each set of k variable separately). More generally, a question given in lfl2l . to identify 
random variables whose correlation exceeds some threshold according to some measure, remains 
widely open. 
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